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DETAILED ACTION 



Response to Arguments 



1 . Applicant's arguments have been fully considered but they are not persuasive. 

As per claim 1 , Applicant argues that Page fails to describe or suggest "training 
of digital voice library to associated syllable item with a literal text syllable of the 
particular speech item," which must be necessarily incorporated in Page under 
principles of inherency. (Amendment, p. 3) However, Applicant acknowledges that 
" diphone dictionary in Page needs to be trained ." (Amendment, p. 3) Therefore, by the 
Applicant's own admission, Page inherently requires training of the voice library 
(dictionary). Applicant further claims that it is not inherent that the training of the library 
occurs as recited in claim 1 . However, claim 1 merely recites associating syllable item 
with a literal text syllable of the particular speech item. Since any text-to-speech library 
training would create some mapping between syllable items (diphones, etc) and text 
required for output, claim 1 does not recite any specific limitations that would make this 
claim patentable. While the Applicant's response does not explicitly specify the 
"claimed technique" which would make this claim patentable, it is noted that the features 
which Applicant infers are not recited in the rejected claim. Although the claims are 
interpreted in light of the specification, limitations from the specification are not read into 
the claims. See In re Van Geuns, 988 F.2d 1 181 , 26 USPQ2d 1057 (Fed. Cir. 1993). 
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As per claims 5-1 1 , Applicant argues that the added references do not overcome 
deficiencies of Page. However, in light of explanation of inherency in the previous 
paragraph, Applicant's arguments on this ground are moot. 

Furthermore, in response to Applicant's arguments that there is no suggestion to 
combine the references for claims 5-1 1 , the examiner recognizes that obviousness can 
only be established by combining or modifying the teachings of the prior art to produce 
the claimed invention where there is some teaching, suggestion, or motivation to do so 
found either in the references themselves or in the knowledge generally available to one 
of ordinary skill in the art. See In re Fine, 837 F.2d 1071, 5 USPQ2d 1596 (Fed. Cir. 
1988) and In re Jones, 958 F.2d 347, 21 USPQ2d 1941 (Fed. Cir. 1992). In this case, 
the motivation is in knowledge generally available to on of ordinary skill in the art, as 
stated later (also in rejection). 

As per claim 5, training dictionaries using neural networks is a well-known 
technique in the art of speech processing. Karalli et al. discloses a Text-To-Speech 
(TTS) system similar to the one taught by Page. As discussed in rejection of claim 1 , 
Page's TTS system would necessarily require training and as a result, Karalli's method 
of training a TTS system would be an efficient solution to the training problem. For 
example, Karalli provides the following efficiency motivation for using neural networks 
for training TTS systems at Col. 1, lines 45-47: "Neural networks overcome large 
storage requirements of concatenative and synthesis-by rule systems, since the 
knowledge base is stored in the weights rather than in memory." 
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As per claim 6, the Examiner has incorporated a reference into the body of 
rejection (instead of official notice) 

As per claims 7-10, Applicant does not specify the reasons for his belief in the 
lack of motivation to combine Page and Lin et al. As it stands, there is sufficient 
motivation in the knowledge generally available to one of ordinary skill in the art to 
combine Page and Lin et al. Page discloses synthesis of the unknown portion of the 
message (B, FIG. 2) and as a result, would necessarily require parsing of the textual 
representation of the unknown word into a sequence of syllables (see breakdown of 
word Three" in B, FIG. 2). While Page does not explicitly disclose the specifics of the 
unknown word parsing, his system would require some efficient method of parsing such 
words. The system of Lin et al. discloses exactly such techniques, and as a result, 
would have been an obvious choice for supplementing Page's system for one skilled in 
the art to allow the system to synthesize speech without having to store all the words in 
advance. This would increase flexibility and storage requirements for the system, as it 
would not have to pre-store all the words (less storage) and could handle arbitrary 
words and acronyms (flexibility). 

As per claim 1 1 , Applicant does not specify the reasons for his belief in the lack 
of motivation to combine Page and Carter et al. However, the motivation can be the 
knowledge generally available to one of ordinary skill in the art. Here, the concept of 
"caching" data to speed up future references is notoriously well-known in the art to 
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improve the performance of the system by storing the frequently used data in the faster 
memory or pre-parsed data structures (such as digital library), thus minimizing the 
search times and I/O delays for database access. Therefore, it would have been 
obvious for one skilled in the art to combine Page and Carter et al. to improve the 
efficiency of the system in situations where same words often re-appear in the input 
and, as a result, avoid duplicate processing of unknown words each time these words 
are encountered. 



Claim Rejections - 35 USC § 102 

2. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that 
form the basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - 

(e) the invention was described in (1) an application for patent, published under section 122(b), by 
another filed in the United States before the invention by the Applicant for patent or (2) a patent 
granted on an application for patent by another filed in the United States before the invention by the 
Applicant for patent, except that an international application filed under the treaty defined in section 
351(a) shall have the effects for purposes of this subsection of an application filed in the United States 
only if the international application designated the United States and was published under Article 21(2) 
of such treaty in the English language. 

3. Claims 1-4 are rejected under 35 U.S.C. 102(e) as being anticipated by Page et 
al. (6,175,821). The table below summarizes limitations of this applications and parts of 
Page et al. that "read on" these limitations. 



Claim# 



Limitations 



Page et al. 
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1 


A method for converting text to concatenated voice 
by utilizing a digital voice library and a set of 
playback rules, the digital voice library including a 
plurality of speech items including words and 
syllables and a corresponding plurality of voice 
recordings wherein each speech item corresponds 
to at least one available voice recording, the method 
comprising: 

training the digital voice library to associate 
each syllable speech item with a literal text syllable 
of the particular syllable speech item. 


The system contains ROM (3, FIG. 1) 
that stores recordings of phrase used 
for messages outputs. In addition, 
speech converter (4, FIG, 1) has a 
diphone dictionary for converting text to 
speech. 

Inherently, for speech synthesis, this 
dictionary has to be trained (or initially 
populated) in order to create a 
mapping between text syllables and 
dyphones. 


2 


The method of claim 1 further comprising: 

receiving a sequence of words including known 
words that correspond to word speech items in the 
digital voice library and including unknown words 

converting each known word into a word speech 
item in accordance with the digital voice library 

and for each unknown word, parsing the 
unknown word to determine a sequence of literal text 
syllables and converting the text syllable sequence 
to a sequence of syllable speech items in 
accordance with the digital voice library. 


The system receives a text message 
(Col. 4, lines 60-63), then synthesizes 
the message using diphone dictionary 
of speech synthesizer (Col. 4, lines 
63-66). In addition, invariable (known) 
portions of the text message are 
converted directly to preset recordings 
by message generator (Col. 5, lines 
42-45) 


3 


The method of claim 2 further comprising: 

converting the sequence of word speech 
items and syllable speech items into a sequence of 
voice recordings in accordance with the set of 
playback rules. 


The variable and invariable portions 
are pre-processed in order to produce 
natural-sounding message (Col. 5, 
lines 36-45) 


4 


The method of claim 3 further comprising: 

generating voice data based on the sequence 
of voice recordings by concatenating adjacent 
recordings in the sequence of voice recordings. 


The variable and invariable portions of 
the message are concatenated 
together into a unified recording by 
message generator (Col. 5, 45-49) 
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Claim Rejections - 35 USC § 103 



4. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

5. Claim 5 is rejected under 35 U.S.C. 103(a) as being obvious over Page et al. in 
view of Karalli et al. (5,668,926). 

As per claim 5, Page et al. discloses a speech converter that has a diphone 
dictionary for converting text to speech (4, FIG. 1 ). 

Page et al. do not disclose training the dictionary by "utilizing a neural network 
having an input and an output to train the digital voice library with the neural network 
receiving the literal text syllable of the particular syllable speech item as input and with 
the neural network outputting the associated syllable speech item." 

Karalli et al. teach the use of neural networks to train the text-to-speech system 
(Col. 2, lines 21-33). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to modify Page et al. as taught in Karalli et al., in order to populate 
the diphone dictionary in the efficient manner and also provide an effective method of 
resolving ambiguous inputs to the dictionary. 



r 
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6. Claim 6 is rejected under 35 U.S.C. 103(a) as being obvious over Page et al. in 
view of Walker (6,510,413) 

Page et al. do not disclose training the digital library by "manually associating 
each syllable speech item with the literal text syllable of the particular syllable speech 
item." 

The process of manually populating any look-up table (or dictionary) is similar to 
the process of inserting the words in a foreign dictionary (For example, English- 
Spanish). In that case, an editor/writer manually creates a mapping between each 
English word and its Spanish translation. Alternatively, similar mappings are using in 
computer arts. For example, "hosts" file on Windows operating system allows the user 
to manually enter the mappings between the IP addresses and host names. Other 
examples in the computer arts abound (such as address books). Therefore, manually 
adding entries to tables/dictionaries of various information is by no means an original 
concept and is well-known in many arts, including computer hardware and software. 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to modify Page et al. to manually associate each literal text syllable 
with the corresponding syllable speech item since this would be the most 
straightforward and "brute force" method of training the dictionary. 

7. Claims 7-10 are rejected under 35 U.S.C. 103(a) as being obvious over Page et 
al. in view of Lin et al. (6,076,060) 



f 
i 
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As per claim 7, Page et al. discloses a speech converter that has a diphone 
dictionary for converting text to speech (4, FIG. 1 ). 

Page et al. do not disclose "parsing the unknown word to determine a sequence 
of literal text syllables and known words, and converting the sequence to a sequence of 
syllable speech items and word speech items in accordance with the digital voice 
library. " 

Lin et al. teach parsing the unknown word into a sequence of syllables and word 
speech items (Col. 6, line 56-60) that are later converted to speech sounds (16, FIG. 2) 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to modify Page et al. as taught in Lin et al., in order to eventually 
create a dyphone representation of each unknown word so it could be synthesized by 
speech synthesizer that requires an input of dyphones to produce the output sound. 

As per claim 8, Page et al. do not disclose parsing that comprises: 

• parsing the unknown word in the forward direction to determine any known words 

• parsing the unknown word in the reverse direction to determine any known words 
where any known words overlap, selecting the larger word 

• parsing the unknown word in the forward direction to determine any literal text 
syllables 

• parsing the unknown word in the reverse direction to determine any literal text 



Lin et al. teach parsing the words in from left-to-right and from right-to-left in 
order to determine sub-words and literal text symbols (Col. 3, lines 45-53). Also, the 
large words are chosen first (Col. 3, lines 55-58). 



syllables. 
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It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to modify Page et al. as taught in Lin et al., in order to create an 
efficient parsing technique that more closely matches the way words are parsed when 
spoken by humans. This method of parsing is less likely to miss important sub-stings in 
unknown words. 

As per claim 9 and 10, Page et al. discloses the calculation and adjustment of 
pitch of the generated message using transition signals and appropriate voice 
recordings (Col. 2, lines 32-48) 

8. Claim 1 1 is rejected under 35 U.S.C. 103(a) as being obvious over Page et al. in 
view of Carter et al. (6,600,814) 

Page does not disclose "for each unknown word, after the unknown word is 
parsed, storing results of the parsing in the digital voice library so that a next encounter 
with the same unknown word may be handled more efficiently." 

Carter et al. teaches storing processed portions of text in the text-to-speech 
system to alleviate the load on the system (Col 2, lines 30-39). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to modify Page et al. as taught by Carter et al. to store the parsed 
results of unknown words so that next attempts with the same words were handled 
more efficiently. This concept of "caching" data for future reference is extremely well- 
known and widely used in the art of computing. 
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Conclusion 



9. THIS ACTION IS MADE FINAL Applicant is reminded of the extension of time 
policy as set forth in 37 CFR 1 . 1 36(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within 
TWO MONTHS of the mailing date of this final action and the advisory action is not 
mailed until after the end of the THREE-MONTH shortened statutory period, then the 
shortened statutory period will expire on the date the advisory action is mailed, and any 
extension fee pursuant to 37 CFR 1 .136(a) will be calculated from the mailing date of 
the advisory action. In no event, however, will the statutory period for reply expire later 
than SIX MONTHS from the mailing date of this final action. 

1 0. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Dmitry Brant whose telephone number is (703) 305- 
8954. The examiner can normally be reached on Mon. - Fri. (8:30am - 5pm). 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Talivaldis Ivars Smits can be reached on (703) 306-301 1 . The fax phone 
number for the organization where this application or proceeding is assigned is (703) 



872-9306. 
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Any inquiry of a general nature or relating to the status of this application or 
proceeding should be directed to Tech Center 2600 receptionist whose telephone 
number is (703) 305- 4700. 
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